feat: add Qwen3.5 MoE calibration module#2383
feat: add Qwen3.5 MoE calibration module#2383Sehyo wants to merge 12 commits intovllm-project:mainfrom
Conversation
Summary of ChangesHello @Sehyo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a specialized calibration module for Qwen3.5 Mixture-of-Experts (MoE) models, designed to facilitate efficient NVFP4 quantization of their expert weights. By dynamically restructuring the MoE block to expose individual expert layers as standard linear modules, it enables the application of fine-grained quantization techniques. A new example script demonstrates this process, ensuring broader compatibility and optimized performance for these large language models. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Code Review
This pull request introduces a calibration module for Qwen3.5 MoE models, enabling NVFP4 quantization. The changes include the core module implementation, its registration within the modeling package, and a comprehensive example script demonstrating its usage on a large-scale model. The implementation correctly unfuses expert weights into individual nn.Linear layers, which is crucial for quantization. The approach of using disable_onloading to handle large model weights on the CPU is well-considered. I have identified one potential issue in the forward pass logic that could lead to errors for MoE models configured with top_k=1, and I have provided a suggestion to address it.
83c7bd8 to
1d428f9
Compare
|
Requesting review alt. ready tag and enhancement tag. |
|
Quantized version with this PR: |
dsikka
left a comment
There was a problem hiding this comment.
This looks really good - thank you!
|
The quality checks have failed. Please run |
|
keep getting RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling |
Is this an error from VLLM? |
|
I have detected an issue in the current upstream version of VLLM which causes the Qwen3.5 NVFP4 quant to fail. in Qwen 3.5 Gated Delta Net, we have some fused / merged projections: and VLLM does fusing like: .. Which assumes plain weight tensors which are concatable.. But NVFP4 format stores weights in weight_packed (4bit packed) way. --> Fused weights are garbage I am currently trying to write a fix for this, if I succeed to get it working will submit a PR to vllm repo as well. |
If we skip quantizing the linear attn layers, wont this issue be resolved? |
|
Do you mind adding a test similar to the tests in this folder: https://github.com/vllm-project/llm-compressor/tree/main/tests/llmcompressor/modeling |
Yes for those layers it does not matter. |
Sure, will do it! |
642ba83 to
d030961
Compare
|
@dsikka Tests have been added. |
|
Review Request |
|
Came to my attention that the MTP modules are dropped from the quant. I am away until sunday but can fix it then. |
|
@JartX i would generally not rely on such specific behavior to be maintained when going from a non quantized to a quantized model. If something like that is needed i would start by doing AWQ or GPTQ with your use cases as a significant part of the calibration data. |
@HDCharles |
|
thats a use-cast specific issue, in practice its assumed that there will be some quantization loss, there are a variety of techniques you can use to quantize your model so it overperforms on your specific use case. As far as this being what you are doing, it looks like your calibrating on ultrachat and nemotron, are those datasets which conform to specific json language? |
The new script @HDCharles |
|
I have been investigating an issue that appears to affect all quantized models capable of running on vLLM main. Specifically, all quantized versions fail to generate structured_output when the input includes images, regardless of the quantization technique used. This has been observed across the following formats: LlmCompressor-AWQ LlmCompressor-GPTQ GPTQ FP8 The failure occurs consistently during the generation phase when multimodal (image) data is present in the prompt. https://huggingface.co/cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit The last quant code: Many thanks for your time UPDATED: I was able to successfully extract the invoice data |
Hello! I have uploaded nvfp4 quants made with this PR on HuggingFace. I haven't been able to see any performance issues / accuracy issues myself, but you can test with my model made by this quant from that link if you have the time. |
- Remove unnecessary disable_onloading() wrapper in qwen3_5_moe.py - Add hasattr fallback for _no_split_modules in get_no_split_params - Use public match_named_modules API instead of private _match_name
|
@Sehyo Your model works well with the images, unlike the others. The structured output is missing some data. After using the instruction dataset with JSON, as recommended by the other user, it worked correctly. I also want to mention that your PR and the |
|
@Sehyo I would switch it to W4A16 Scheme; the group size is for it to work on Exllama in my RDNA3 |
|
@Sehyo Hi, I’ve encountered a couple of issues while running a modified version of your example code. Modification to the quantization script: scheme_0 = FP8_DYNAMIC
scheme_0["targets"] = ["re:.*self_attn.o_proj", "re:.*linear_attn.in_proj_qkv", "re:.*linear_attn.in_proj_z", "re:.*linear_attn.out_proj"]
scheme_1 = NVFP4
scheme_1["targets"] = ["re:.*self_attn.(q|k|v)_proj", "re:.*mlp.experts.*.*_proj"]
ignore = ["re:.*lm_head", "re:visual.*", "re:model.visual.*", "re:.*mlp.gate$", "re:.*norm.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$", "re:.*mtp.*", "re:.*conv1d.*", "re:.*in_proj_a*", "re:.*in_proj_b*", "re:.*in_proj_c*"]
recipe = QuantizationModifier(
config_groups={"group_0": scheme_0, "group_1": scheme_1}, ignore=ignore
)Expected behavior: Result: Only Another issue: the exported tokenizer metadata appears to use an unexpected class: |
There was and still may be, an issue using Mixed Precision with NVFP4 in VLLM. Be aware of that, as that may be occurring here. I closed my PR, as I didn't see yours @Sehyo . Your code was very close to mine, and your MTP handling is solid for peeps who turn it on. Thanks for Submitting this. |
@phaelon74 Thanks for the information! I’ll open a separate issue to discuss this, since it seems unrelated to this PR. I wonder if this is specific to the new Edit: I found the issue. Turns out the regex wasn’t matching in my script. scheme_0 = FP8_DYNAMIC
scheme_0["targets"] = [
"re:.*self_attn.o_proj$",
"re:.*linear_attn.in_proj_qkv$",
"re:.*linear_attn.in_proj_z$",
"re:.*linear_attn.out_proj$",
]
scheme_1 = NVFP416
scheme_1["targets"] = [
"re:.*self_attn.(q|k|v)_proj$",
"re:.*mlp.experts.*.*_proj$",
]
ignore = ["re:.*lm_head", "re:visual.*", "re:model.visual.*", "re:.*mlp.gate$", "re:.*norm.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$", "re:.*mtp.*", "re:.*conv1d.*", "re:.*in_proj_a+", "re:.*in_proj_b+", "re:.*in_proj_c+"]
recipe = QuantizationModifier(
config_groups={"group_0": scheme_0, "group_1": scheme_1}, ignore=ignore
) |
There was a problem hiding this comment.
Overall this looks fine but I dont quite understand why we need an updated regex pattern, _update_config_expanded_ignore, or _graft_extra_weights?i I think generally, if we want to expand regex mapping, that shoud be done in a follow-up PR as it is not specific to Qwen3.5
I am able to generate quantized checkpoints without this
| # by regex (e.g. MoE router modules that aren't nn.Linear). | ||
| # Store expanded names on the model so the save wrapper can ensure | ||
| # they appear in config.json. | ||
| regex_patterns = [p for p in self.ignore if p.startswith("re:")] |
There was a problem hiding this comment.
Can you explain why you need this?
There was a problem hiding this comment.
I did not have this in mine, and mine quanted and loaded successfully in VLLM, so would love to know as well.
|
The quality checks have failed. Please run |
Graft extra weights is for re-adding MTP weights back in as they get dropped. |
@Sehyo I think we want to do this at the end when we're saving the checkpoint, not in the middle of calibration as it does not impact quantization. Do you mind also resolving the quality issues? |
|
The quality checks have failed. Please run |
I noticed your code uses from transformers import AutoProcessor, AutoTokenizer, Qwen3_5MoeForConditionalGeneration, which requires transformers>=5.2.0. However, the from llmcompressor import oneshot code indicates that the latest version of llmcompressor depends on transformers>=4.56.1, <=4.57.6. |
|
Hi @Sehyo I am going to break this PR and land it in smaller pieces as some of this functionality is now out of date. Thank you for the contribution! |
Apologies for this ask @dsikka , but can you map it out please, as I am having to use my PR to making my Qwen3.5 quants work, so would be nice to know which PRs you will align into implementation, so I know when they land, etc. |
Summary
CalibrationQwen3_5MoeSparseMoeBlockcalibration module that unfuses Qwen3.5's 3D fused expert parameters into individualQwen3_5MoeMLPmodules withnn.Linearlayers, enabling NVFP4 quantization of expert weightsmodeling/__init__.pyQwen/Qwen3.5-397B-A17BDetails
Qwen3.5 MoE (
Qwen3_5MoeSparseMoeBlock) stores all expert weights in fused 3Dnn.Parametertensors (gate_up_proj: [num_experts, 2*intermediate, hidden],down_proj: [num_experts, hidden, intermediate]). The calibration module unfuses these into individual MLP modules sotargets="Linear"can match and quantize them.The implementation follows the same pattern as
CalibrateQwen3VLMoeTextSparseMoeBlockwithis_permanent=True, and includesdisable_onloading()for safe CPU access to offloaded parameters on large models.